In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sn

sn.set_style("darkgrid")

Importing Datasets

In [2]:
red_wine_df=pd.read_csv('winequality-red.csv', delimiter=';')
In [112]:
white_wine_df=pd.read_csv('winequality-white.csv', delimiter=';')

Red Wine Dataset Columns

In [4]:
red_wine_df.columns
Out[4]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

White Wine Dataset Columns

In [5]:
white_wine_df.columns
Out[5]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

Columns Description¶

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):¶

12 - quality (score between 0 and 10)

Exploratory Data Analysis 📊¶

A) Red Wine Analysis 🍷¶

Shwoing the first five rows

In [6]:
red_wine_df.head()
Out[6]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Showing the dataset shape

In [7]:
red_wine_df.shape
Out[7]:
(1599, 12)

Showing the total information

In [8]:
red_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Showing the null values

In [9]:
red_wine_df.isnull().sum()
Out[9]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Showing the duplicates

In [10]:
red_wine_df.duplicated().sum()
Out[10]:
240

Descriptive Statistics

In [11]:
red_wine_df.describe()
Out[11]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

a) Labeled Column¶

In [12]:
red_wine_df['quality'].value_counts().plot(kind='bar', figsize=(10,6), color=sn.color_palette('viridis'))
Out[12]:
<AxesSubplot: >

b) Let's check which of the columns are highly co-related to quality using Pairplot¶

In [111]:
sn.pairplot(red_wine_df, hue='quality')
Out[111]:
<seaborn.axisgrid.PairGrid at 0x240c2e5b790>

c) Visualizing the co-relations with the variables¶

In [14]:
plt.figure(figsize=(12,8))

sn.heatmap(red_wine_df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
Out[14]:
<AxesSubplot: >

Visualizing the co-relations with the respect to labeled column¶

Observations¶

  • Free Sulpher Dioxide and Total sulpher Dioxide have some positive relation to Residual Sugar.
  • Density has positive correlation with fixed acidity and residual sugar.
  • Density has negetive correlation with alcohol and pH.
  • Quality has positive correlation with alcohol,critic acid and sulphates and negetive correlation with critic acid. We need to explore this further.
  • Fixed acidity has high positive correlation with critic acid and density and negetive correlation with pH.
  • Residual sugar has positive correlation with critic acid.
  • pH has negetive correlation with fixed acidity and criic acid, but positive correlation with volatile acid.
In [15]:
red_wine_df.corr()['quality'].plot(kind='bar', figsize=(15,8))
Out[15]:
<AxesSubplot: >

d) Showing the Distribution of Alcohol Column¶

In [16]:
plt.figure(figsize=(10,6))

sn.histplot(red_wine_df['alcohol'], kde=True, palette='mako')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\2835092100.py:3: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sn.histplot(red_wine_df['alcohol'], kde=True, palette='mako')
Out[16]:
<AxesSubplot: xlabel='alcohol', ylabel='Count'>

As you can see here that, Alcohol content is positively Skewed¶

Skewness

In [17]:
from scipy.stats import skew
skew(red_wine_df['alcohol'])
Out[17]:
0.8600210646566755

Mean

In [18]:
red_wine_df['alcohol'].mean()
Out[18]:
10.422983114446529

Median

In [19]:
red_wine_df['alcohol'].median()
Out[19]:
10.2

Let's see how alcohol varies w.r.t quality¶

To not showing the outliers we use here showfliers=False

In [134]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='alcohol', data=red_wine_df, showfliers=False, palette='dark')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\3473818432.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='alcohol', data=red_wine_df, showfliers=False, palette='dark')
Out[134]:
<AxesSubplot: xlabel='quality', ylabel='alcohol'>

e) Correlation with Alcohol and pH¶

In [22]:
plt.figure(figsize=(10,8))

sn.jointplot(x='alcohol', y='pH', data=red_wine_df, kind='reg')
Out[22]:
<seaborn.axisgrid.JointGrid at 0x240b49f70a0>
<Figure size 720x576 with 0 Axes>

It's a positive co-relation.

In [24]:
from scipy.stats import pearsonr

correlation_coefficient, p_value = pearsonr(red_wine_df['alcohol'], red_wine_df['pH'])
print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: 0.20563250850549833
P-value: 9.964497741457687e-17

f) Co-relation with alcohol and density¶

In [26]:
plt.figure(figsize=(10,8))

sn.jointplot(x='alcohol', y='density', data=red_wine_df, kind='reg')
Out[26]:
<seaborn.axisgrid.JointGrid at 0x240b6e41e70>
<Figure size 720x576 with 0 Axes>

As you can see, It's a negetive co-relation.¶

In [27]:
correlation_coefficient, p_value = pearsonr(red_wine_df['alcohol'], red_wine_df['density'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.4961797702417019
P-value: 3.9388353399870764e-100
In [28]:
g=sn.FacetGrid(red_wine_df, col='quality')
g=g.map(sn.regplot, 'density','alcohol')

When we going to increase the quality of the wine, you can see that the correlation between the alcohol and the density is tend to negetive.¶

g) Let's Analyze sulphates and Quality¶

In [133]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='sulphates', data=red_wine_df, showfliers=False, palette='magma')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\1552758158.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='sulphates', data=red_wine_df, showfliers=False, palette='magma')
Out[133]:
<AxesSubplot: xlabel='quality', ylabel='sulphates'>

As you can see that, as the Quality improves the sulphates is going higher -- so it's indicates a positive Correlation.¶

In [132]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='total sulfur dioxide', data=red_wine_df, showfliers=False, palette='colorblind')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\1925034777.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='total sulfur dioxide', data=red_wine_df, showfliers=False, palette='colorblind')
Out[132]:
<AxesSubplot: xlabel='quality', ylabel='total sulfur dioxide'>
In [131]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='free sulfur dioxide', data=red_wine_df, showfliers=False, palette='Set3')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\2347094703.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='free sulfur dioxide', data=red_wine_df, showfliers=False, palette='Set3')
Out[131]:
<AxesSubplot: xlabel='quality', ylabel='free sulfur dioxide'>

h) Let's move on to fixed acidity, volatile acidity and critic acid¶

In [130]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='fixed acidity', data=red_wine_df, palette='Set2')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\900932625.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='fixed acidity', data=red_wine_df, palette='Set2')
Out[130]:
<AxesSubplot: xlabel='quality', ylabel='fixed acidity'>
In [129]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='citric acid', data=red_wine_df, palette='husl')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\1796001089.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='citric acid', data=red_wine_df, palette='husl')
Out[129]:
<AxesSubplot: xlabel='quality', ylabel='citric acid'>

It denotes the positive relationship with Quality. The more citric acid, the wine will be taste better.¶

In [128]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='volatile acidity', data=red_wine_df, palette='rainbow')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\2572236511.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='volatile acidity', data=red_wine_df, palette='rainbow')
Out[128]:
<AxesSubplot: xlabel='quality', ylabel='volatile acidity'>

It denoets the negetive relation. The higher the volatile acidity, the wine will be taste worst¶

i) Trends between other columns¶

In [39]:
red_wine_df.columns
Out[39]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

j) Visualizing the Correlation of ph and Volatile Acidity¶

In [40]:
correlation_coefficient, p_value = pearsonr(red_wine_df['pH'], red_wine_df['volatile acidity'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: 0.23493729440739328
P-value: 1.7189939570061834e-21

As you can see it showing the weaker correlation with pH. Volatile acidity is actually Acidic acid, it is weak.¶

k) Creat a new column Total Acidity¶

In [41]:
red_wine_df['total acidity']=(red_wine_df['fixed acidity']+ red_wine_df['citric acid']+ red_wine_df['volatile acidity'])
In [42]:
plt.figure(figsize=(10,6))

sn.boxplot(x='quality', y='total acidity', data=red_wine_df, palette='mako')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\3485094029.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='total acidity', data=red_wine_df, palette='mako')
Out[42]:
<AxesSubplot: xlabel='quality', ylabel='total acidity'>

This is actually not find any trend or relation here.¶

l) The relation between pH and Total Acidity.¶

In [43]:
plt.figure(figsize=(10,6))

sn.regplot(x='pH', y='total acidity', data=red_wine_df)
Out[43]:
<AxesSubplot: xlabel='pH', ylabel='total acidity'>

It shows a negetive correlation. It means that if you have more acid then the ph will be lower.¶

In [44]:
g=sn.FacetGrid(red_wine_df, col='quality')
g=g.map(sn.regplot, 'total acidity','pH')

1. For the Higher Quality Wines the negetive correlation is much stronger than the lower qualities wine.¶

2. It also makes sense that in the lower quality the samples will be lesser.¶

In [45]:
correlation_coefficient, p_value = pearsonr(red_wine_df['pH'], red_wine_df['total acidity'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.6834838221663891
P-value: 1.442656031677709e-220

B) White Wine Analysis 🍸¶

Showing the first five rows

In [46]:
white_wine_df.head()
Out[46]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

Showing the last five rows

In [47]:
white_wine_df.tail()
Out[47]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6
4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5
4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6
4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7
4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6

Showing the dataset shape

In [48]:
white_wine_df.shape
Out[48]:
(4898, 12)

Showing the total information of the dataset

In [49]:
white_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

Checking Null Values

In [50]:
white_wine_df.isnull().sum()
Out[50]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Showing the duplicates

In [51]:
white_wine_df.duplicated().sum()
Out[51]:
937

Showing the unique columns

In [52]:
white_wine_df.nunique()
Out[52]:
fixed acidity            68
volatile acidity        125
citric acid              87
residual sugar          310
chlorides               160
free sulfur dioxide     132
total sulfur dioxide    251
density                 890
pH                      103
sulphates                79
alcohol                 103
quality                   7
dtype: int64

Descriptive Statistics

In [53]:
white_wine_df.describe()
Out[53]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000

Analysis over White Wine 📉¶

a) Let's check at the Quality Column¶

In [54]:
white_wine_df['quality'].value_counts().plot(kind='bar', figsize=(10,6), color=sn.color_palette('magma'))
Out[54]:
<AxesSubplot: >

1. As you can see the quality number 6 wine is available for everyone(majority).¶

2. On the other hand the the quality number 9 is very less with extremely good quality and number 3 wine is the worst wine.¶

b) Let's check which of the other columns are highly correlated to Quality using pairplot¶

In [113]:
sn.pairplot(white_wine_df, hue='quality')
Out[113]:
<seaborn.axisgrid.PairGrid at 0x240c2e5bf70>

c) Visualizing the Correlations using Heatmap¶

In [56]:
sn.set(rc={'figure.figsize':(11,7)})
In [57]:
sn.heatmap(white_wine_df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
Out[57]:
<AxesSubplot: >
  • Free Sulpher Dioxide and Total Sulpher Dioxide have some positive relation to Residual Sugar.On further inspection, I found that the quantity of SO2 is dependent on Sugar content.
  • Chlorides, density and volatile acidity have weak negetive correlation with quality.
  • alcohol has positive correlation with quality.

d) Visualizing the Correlations with Respect to Quality Column¶

In [135]:
white_wine_df.corr()['quality'].plot(kind='bar', figsize=(15,8))
Out[135]:
<AxesSubplot: >
In [58]:
sn.histplot(white_wine_df['alcohol'], kde=True)
Out[58]:
<AxesSubplot: xlabel='alcohol', ylabel='Count'>

Alcohol content is positively skewed.¶

Skewness

In [59]:
from scipy.stats import skew
skew(white_wine_df['alcohol'])
Out[59]:
0.48719273327634327

The mean

In [60]:
white_wine_df['alcohol'].mean()
Out[60]:
10.514267047774602

The median

In [61]:
white_wine_df['alcohol'].median()
Out[61]:
10.4

e) Showing the relationship between Quality and Alcohol¶

In [116]:
sn.boxplot(x='quality', y='alcohol', data=white_wine_df, palette='Set2')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\908718883.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='alcohol', data=white_wine_df, palette='Set2')
Out[116]:
<AxesSubplot: xlabel='quality', ylabel='alcohol'>

1) As quality improves, the alcohol trend is higher. So it shows a positive relation.¶

2) It shows less correlation between alcohol and Quality.¶

In [63]:
correlation_coefficient, p_value = pearsonr(white_wine_df['alcohol'], white_wine_df['pH'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: 0.12143209874912966
P-value: 1.4900595881932524e-17
In [117]:
sn.boxplot(x='quality', y='pH', data=white_wine_df, palette='Set3')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\3452518404.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='pH', data=white_wine_df, palette='Set3')
Out[117]:
<AxesSubplot: xlabel='quality', ylabel='pH'>

f) The relationship between Alcohol and Density.¶

In [118]:
joint_plot=sn.jointplot(x='alcohol', y='density', data=white_wine_df, kind='reg', palette='mako')

It shows a negetive correlation between alcohol and density.¶

In [66]:
correlation_coefficient, p_value = pearsonr(white_wine_df['alcohol'], white_wine_df['density'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.7801376214255598
P-value: 0.0
In [67]:
g=sn.FacetGrid(white_wine_df, col='quality')
g=g.map(sn.regplot, 'pH','alcohol')

g) Let's analyze sulphates and quality¶

In [119]:
sn.boxplot(x='quality', y='sulphates', data=white_wine_df, palette='mako')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\408880811.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='sulphates', data=white_wine_df, palette='mako')
Out[119]:
<AxesSubplot: xlabel='quality', ylabel='sulphates'>
In [120]:
sn.boxplot(x='quality', y='total sulfur dioxide', data=white_wine_df, palette='magma')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\3293237333.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='total sulfur dioxide', data=white_wine_df, palette='magma')
Out[120]:
<AxesSubplot: xlabel='quality', ylabel='total sulfur dioxide'>
In [70]:
correlation_coefficient, p_value = pearsonr(white_wine_df['quality'], white_wine_df['total sulfur dioxide'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.17473721759706368
P-value: 6.991898124258417e-35

This tends to be a weak correlation.¶

In [121]:
sn.boxplot(x='quality', y='free sulfur dioxide', data=white_wine_df, palette='husl')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\4015704598.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='free sulfur dioxide', data=white_wine_df, palette='husl')
Out[121]:
<AxesSubplot: xlabel='quality', ylabel='free sulfur dioxide'>
In [72]:
correlation_coefficient, p_value = pearsonr(white_wine_df['quality'], white_wine_df['free sulfur dioxide'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: 0.008158067123436157
P-value: 0.5681271459219848

There's not much correlation present in here.¶

In [122]:
sn.boxplot(x='quality', y='volatile acidity', data=white_wine_df, palette='colorblind')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\1497120699.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='volatile acidity', data=white_wine_df, palette='colorblind')
Out[122]:
<AxesSubplot: xlabel='quality', ylabel='volatile acidity'>
In [74]:
correlation_coefficient, p_value = pearsonr(white_wine_df['quality'], white_wine_df['volatile acidity'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.19472296892113533
P-value: 4.673261270702513e-43

There is weak negetive correlation present in here.¶

h) Relation between Residual Sugar and Density¶

In [75]:
joint_plot=sn.jointplot(x='residual sugar', y='density', data=white_wine_df, kind='reg')

This is a high correlation.¶

i) Create a new column total acidity¶

In [123]:
white_wine_df['total acidity']=(white_wine_df['fixed acidity']+white_wine_df['citric acid']+ white_wine_df['volatile acidity'])

sn.boxplot(x='quality', y='total acidity', data=white_wine_df, palette='Set1')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\3417362548.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='total acidity', data=white_wine_df, palette='Set1')
Out[123]:
<AxesSubplot: xlabel='quality', ylabel='total acidity'>

There's no such a correlation present in here.¶

In [77]:
correlation_coefficient, p_value = pearsonr(white_wine_df['quality'], white_wine_df['total acidity'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.13137720684953472
P-value: 2.6507804041318808e-20

j) Let's move on to citric acid¶

In [124]:
sn.boxplot(x='quality', y='citric acid', data=white_wine_df, palette='Set3')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\529856444.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='citric acid', data=white_wine_df, palette='Set3')
Out[124]:
<AxesSubplot: xlabel='quality', ylabel='citric acid'>
In [79]:
joint_plot=sn.jointplot(x='pH', y='citric acid', data=white_wine_df, kind='reg')
In [80]:
correlation_coefficient, p_value = pearsonr(white_wine_df['pH'], white_wine_df['citric acid'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)
Pearson correlation coefficient: -0.16374821140062382
P-value: 8.783728611505257e-31

It is showing a very low correlation.¶

k) Finally let's check Relation between Residual Sugar and Quality¶

In [126]:
sn.boxplot(x='quality', y='residual sugar', data=white_wine_df, palette='Set2')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\2414749223.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='residual sugar', data=white_wine_df, palette='Set2')
Out[126]:
<AxesSubplot: xlabel='quality', ylabel='residual sugar'>
In [127]:
white_wine_df['Crisp Ratio']=white_wine_df['total acidity'] / white_wine_df['residual sugar']

sn.boxplot(x='quality', y='Crisp Ratio', data=white_wine_df, showfliers=False, palette='dark')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\936864896.py:3: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sn.boxplot(x='quality', y='Crisp Ratio', data=white_wine_df, showfliers=False, palette='dark')
Out[127]:
<AxesSubplot: xlabel='quality', ylabel='Crisp Ratio'>

l) After analyszing the White wine, we can say that:¶

1. This is not a sweet wine.¶

2. There is good amount of acidity present.¶

3. Total acidity is actually overpowering the residual sugar.¶


C) Comparative Analysis of White Wine and Red Wine Data 🍸🍷¶

In [83]:
red_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
 12  total acidity         1599 non-null   float64
dtypes: float64(12), int64(1)
memory usage: 162.5 KB
In [84]:
white_wine_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
 12  total acidity         4898 non-null   float64
 13  Crisp Ratio           4898 non-null   float64
dtypes: float64(13), int64(1)
memory usage: 535.8 KB
In [85]:
red_wine_df.describe()
Out[85]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality total acidity
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023 9.118433
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569 1.832708
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000 5.270000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000 7.827500
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000 8.720000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000 10.070000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000 17.045000
In [86]:
white_wine_df.describe()
Out[86]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality total acidity Crisp Ratio
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909 7.467220 2.532658
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639 0.887962 2.249911
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000 4.130000 0.142325
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000 6.890000 0.776772
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000 7.405000 1.384058
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000 7.960000 4.256250
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000 14.960000 15.483333
  1. Residual sugars are comparatively higher in White Wines compare to Red Wines.
  2. Sulpher Di-oxide are comparatively higher in White Wines compare to Red Wines.
  3. Densiy is more or less same in white wine and red wine.
  4. pH,sulphates,alcohol quantity is kind of same of both wines.
  5. Same goes for Quality.

A) Now, let's talking about Heatmaps¶

In [87]:
sn.heatmap(red_wine_df.corr(), annot=True, cmap='viridis',fmt='.2f')
Out[87]:
<AxesSubplot: >

i) For Red wine, with respect to Quality column:¶

a) For Positive Correlation -->¶

1. Alcohol
2. Fixed Acidity
3. Sulphates
4. Citric Acid

b) For Negetive Correlation -->¶

1. Volatile Acidity
2. Total Sulfur dioxde
3. density
4. chlorides
In [88]:
sn.heatmap(white_wine_df.corr(), annot=True, cmap='viridis', fmt='.2f')
Out[88]:
<AxesSubplot: >

ii) For White wine, with respect to Quality column:¶

a) For Positive Correlation -->¶

1. Alcohol
2. pH(weak)

b) For Negetive Correlation -->¶

1. Volatile Acidity
2. Total Sulfur dioxde
3. density
4. chlorides
5. residual sugar(weak)

B) Combine the datasets¶

In [89]:
red_wine_df['type']='Red'
white_wine_df['type']='White'
In [90]:
wines_df=pd.concat([red_wine_df, white_wine_df])
In [91]:
wines_df
Out[91]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality total acidity type Crisp Ratio
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 8.10 Red NaN
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5 8.68 Red NaN
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5 8.60 Red NaN
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6 12.04 Red NaN
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5 8.10 Red NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6 6.70 White 4.187500
4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5 7.28 White 0.910000
4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6 6.93 White 5.775000
4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7 6.09 White 5.536364
4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6 6.59 White 8.237500

6497 rows × 15 columns

In [92]:
wines_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6497 entries, 0 to 4897
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  total acidity         6497 non-null   float64
 13  type                  6497 non-null   object 
 14  Crisp Ratio           4898 non-null   float64
dtypes: float64(13), int64(1), object(1)
memory usage: 812.1+ KB

Comparative Analysis 📉💹¶

C) For the Quality Column¶

In [93]:
sn.countplot(x='quality', hue='type', data=wines_df)
Out[93]:
<AxesSubplot: xlabel='quality', ylabel='count'>

1) If you observe the dataset, you can see that the number of samples are more in White wine with respect to Red wine.¶

2) So the Plot is looks like this.¶

3) And WIne Quality 9 is not present in Red wine. It only present in White Wines.¶

D) PLotting a Density Plot for more clarity¶

In [95]:
p1=sn.kdeplot(red_wine_df['quality'], shade=True, color='r', label='red whine')
p2=sn.kdeplot(white_wine_df['quality'], shade=True, color='b', label='white whine')

plt.legend()
plt.show()
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\608723478.py:1: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  p1=sn.kdeplot(red_wine_df['quality'], shade=True, color='r', label='red whine')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\608723478.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  p2=sn.kdeplot(white_wine_df['quality'], shade=True, color='b', label='white whine')

For the size of the white wine it's overlapping the graph for red wine.¶

E) For quality and density Column¶

In [96]:
sn.boxplot(x='quality', y='alcohol', hue='type', data=wines_df, palette=['r','w'])
Out[96]:
<AxesSubplot: xlabel='quality', ylabel='alcohol'>

1)Overall alcohol content seems to little higher for white wine comparing with red wine.¶

2) Except for Quality 8, where Red wine is slightly higher.¶

In [97]:
sn.boxplot(x='quality', y='density', hue='type', data=wines_df, palette=['r','w'], showfliers=False)
Out[97]:
<AxesSubplot: xlabel='quality', ylabel='density'>

a) It actually shows us a negetive correlation means, if quality improves then density will be lower.¶

b) Also Red wine has more density than white wines.¶

F) PLotting a Joint plot with alcohol and Residual sugar¶

In [98]:
sn.jointplot(x='alcohol', y='residual sugar', data=wines_df, hue='type')
Out[98]:
<seaborn.axisgrid.JointGrid at 0x240be27df90>
In [99]:
sn.boxplot(x='quality', y='residual sugar', hue='type', data=wines_df, palette=['r','w'], showfliers=False)
Out[99]:
<AxesSubplot: xlabel='quality', ylabel='residual sugar'>

So as you can see red wine has very less sugar as compared to White wines.¶

G) The Distributions of Residual Sugar for Both Wines¶

In [100]:
p1=sn.kdeplot(red_wine_df['residual sugar'], shade=True, color='r', label='red whine')
p2=sn.kdeplot(white_wine_df['residual sugar'], shade=True, color='b', label='white whine')

plt.legend()
plt.show()
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\227625680.py:1: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  p1=sn.kdeplot(red_wine_df['residual sugar'], shade=True, color='r', label='red whine')
C:\Users\USER\AppData\Local\Temp\ipykernel_7016\227625680.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  p2=sn.kdeplot(white_wine_df['residual sugar'], shade=True, color='b', label='white whine')
In [101]:
sn.regplot(x='alcohol', y='residual sugar', data=wines_df)
Out[101]:
<AxesSubplot: xlabel='alcohol', ylabel='residual sugar'>

So as you can see that it showing us a negetive relation between The Alcohol and Residual Sugar.¶

H) Next we will analyzing the Sulphates Columns¶

In [102]:
sn.boxplot(x='quality', y='total sulfur dioxide', hue='type', data=wines_df, palette=['r','w'])
Out[102]:
<AxesSubplot: xlabel='quality', ylabel='total sulfur dioxide'>
In [103]:
sn.boxplot(x='quality', y='free sulfur dioxide', hue='type', data=wines_df, palette=['r','w'], showfliers=False)
Out[103]:
<AxesSubplot: xlabel='quality', ylabel='free sulfur dioxide'>

So as you can see that there is so much difference betweem Red wine and White wine.¶

In [104]:
sn.boxplot(x='quality', y='sulphates', hue='type', data=wines_df, palette=['r','w'],showfliers=False)
Out[104]:
<AxesSubplot: xlabel='quality', ylabel='sulphates'>

1) As you can see the quantity of sulphates is greater in Red wine.¶

2) AS it compare to white wine as you can see there is a lesser amount of sulpher di-oxide. That's why it a lesser than Red Wine.¶

I) Now Comparing with Citric Acid¶

In [105]:
sn.boxplot(x='quality', y='citric acid', hue='type', data=wines_df, palette=['r','w'],showfliers=False)
Out[105]:
<AxesSubplot: xlabel='quality', ylabel='citric acid'>

It is a positive relationship, that we are seeing in here.¶

In [107]:
sn.boxplot(x='quality', y='chlorides', hue='type', data=wines_df, palette=['r','w'],showfliers=False)
Out[107]:
<AxesSubplot: xlabel='quality', ylabel='chlorides'>

As you can see it describes here as a negetive relationship.¶

J) Now we are Combining those three acidities¶

In [108]:
wines_df['total acidity']=wines_df['fixed acidity'] + wines_df['volatile acidity'] + wines_df['citric acid']

sn.boxplot(x='quality', y='total acidity', hue='type', data=wines_df,
          palette=['r','w'], showfliers=False)
Out[108]:
<AxesSubplot: xlabel='quality', ylabel='total acidity'>

K) Conclusion : -->¶

a) Red Wine seem to be overall more acidic as compare to White Wines.¶

b) This might be cause that Red wines have a presence of a distinctive component in red wines known as tannins, , that basically combines acidity and it brings unique test to Red Wine. And that is not present in White Wine.¶